<html>
<head>
<link rel="Stylesheet" type="text/css" href="../../DocStyle.css" />
<title>Vczh Library++ Regular Expression</title>
</head>
<body>
<h1>namespace regex;</h1>
<h2><a name="RegexString">RegexString</a></h2>
<p>RegexString represents a part of the input string.</p>
<p>See <a href="#Grammar">Regular expression grammar reference</a> for more information.</p>
<table>
<thead><tr><td colspan="2" style="text-align:center; font-weight:bold">Member functions</td></tr>
<tr><td style="width:100">Parameter</td><td>Description</td></tr></thead>
<tbody>
<tr><td colspan="2" class="method_sign">int Start()const</td></tr>
<tr><td>[result]</td><td>The start position in the input string.</td></tr>

<tr><td colspan="2" class="method_sign">int Length()const</td></tr>
<tr><td>[result]</td><td>Character count of the string.</td></tr>

<tr><td colspan="2" class="method_sign">const WString& Value()const</td></tr>
<tr><td>[result]</td><td>Get a WString representing the string.</td></tr>

<tr><td colspan="2" class="method_sign">bool operator==(const RegexString& string)const</td></tr>
<tr><td>[result]</td><td>The comparison result.</td></tr>
<tr><td>string</td><td>The string to compare.</td></tr>
</tbody>
</table>
<h2><a name="RegexMatch">RegexMatch</a></h2>
<p>RegexMatch represents a match of a regular expression.</p>
<p>See <a href="#Grammar">Regular expression grammar reference</a> for more information.</p>
<table>
<thead><tr><td colspan="2" style="text-align:center; font-weight:bold">Member functions</td></tr>
<tr><td style="width:100">Parameter</td><td>Description</td></tr></thead>
<tbody>
<tr><td colspan="2" class="method_sign">bool Success()const</td></tr>
<tr><td>[result]</td><td>True indicates this is a successful match, otherwise false.</td></tr>

<tr><td colspan="2" class="method_sign">const RegexString& Result()const</td></tr>
<tr><td>[result]</td><td>The whole string of the match.</td></tr>

<tr><td colspan="2" class="method_sign">collections::IReadonlyList&lt;RegexString&gt; Captures()const</td></tr>
<tr><td>[result]</td><td>Get a list for all anonymous captures.</td></tr>

<tr><td colspan="2" class="method_sign">collections::IReadonlyGroup&lt;WString, RegexString&gt; Groups()const</td></tr>
<tr><td>[result]</td><td>Get a group for all named captures. Regex allows multiple captures under a name.</td></tr>
</tbody>
</table>
<h2><a name="Regex">Regex</a></h2>
<p>Regular expression engine</p>
<p>See <a href="#Grammar">Regular expression grammar reference</a> for more information.</p>
<table>
<thead><tr><td colspan="2" style="text-align:center; font-weight:bold">Member functions</td></tr>
<tr><td style="width:100">Parameter</td><td>Description</td></tr></thead>
<tbody>
<tr><td colspan="2" class="method_sign">Regex(const WString& code, bool preferPure=true)</td></tr>
<tr><td>code</td><td>A string represents a regular expression.</td></tr>
<tr><td>preferPure</td><td>If preferPure is set to true, Regex will analysis the regular expression and use a much faster algorithm if the expression has no capture back references, lookahead assertions and non-eager loopings.</td></tr>

<tr><td colspan="2" class="method_sign">bool IsPureMatch()const</td></tr>
<tr><td>[result]</td><td>True indicates Regex use the faster algorithm for calculating a match(with capture information).</td></tr>

<tr><td colspan="2" class="method_sign">bool IsPureTest()const</td></tr>
<tr><td>[result]</td><td>True indicates Regex use the faster algorithm for searching a match only(without capture information).</td></tr>

<tr><td colspan="2" class="method_sign">Ptr&lt;RegexMatch&gt; MatchHead(const WString& text)const</td></tr>
<tr><td>[result]</td><td>A match which begins at the first character of the input string.</td></tr>
<tr><td>text</td><td>The input string.</td></tr>

<tr><td colspan="2" class="method_sign">Ptr&lt;RegexMatch&gt; Match(const WString& text)const</td></tr>
<tr><td>[result]</td><td>A match in the input string.</td></tr>
<tr><td>text</td><td>The input string.</td></tr>

<tr><td colspan="2" class="method_sign">bool TestHead(const WString& text)const</td></tr>
<tr><td>[result]</td><td>True indicates there is a match which begins at the first character of the input string, otherwise false.</td></tr>
<tr><td>text</td><td>The input string.</td></tr>

<tr><td colspan="2" class="method_sign">bool TestBool(const WString& text)const</td></tr>
<tr><td>[result]</td><td>True indicates there is a match in the input string, otherwise false.</td></tr>
<tr><td>text</td><td>The input string.</td></tr>

<tr><td colspan="2" class="method_sign">void Search(const WString& text, RegexMatch::List& matches)const</td></tr>
<tr><td>text</td><td>The input string.</td></tr>
<tr><td>matches</td><td>All matches in the input string.</td></tr>

<tr><td colspan="2" class="method_sign">Split(const WString& text, bool keepEmptyMatch, RegexMatch::List& matches)const</td></tr>
<tr><td>text</td><td>The input string.</td></tr>
<tr><td>keepEmptyMatch</td><td>True for discarding all empty non-match parts.</td></tr>
<tr><td>matches</td><td>All non-match parts in the input string which are separated by the matches.</td></tr>

<tr><td colspan="2" class="method_sign">Cut(const WString& text, bool keepEmptyMatch, RegexMatch::List& matches)const</td></tr>
<tr><td>text</td><td>The input string.</td></tr>
<tr><td>keepEmptyMatch</td><td>True for discarding all empty non-match parts.</td></tr>
<tr><td>matches</td><td>All match parts and non-match parts in the input string which are separated by the matches.</td></tr>
</tbody>
</table>
<h2><a name="RegexToken">RegexToken</a></h2>
<p>RegexToken records information of a token</p>
<p>See <a href="#Grammar">Regular expression grammar reference</a> for more information.</p>
<table>
<thead><tr><td colspan="2" style="text-align:center; font-weight:bold">Member Variables</td></tr>
<tr><td style="width:100">Parameter</td><td>Description</td></tr></thead>
<tbody>
<tr><td colspan="2" class="method_sign">int start;</td></tr>
<tr><td>[result]</td><td>The positionof the token in the input string.</td></tr>

<tr><td colspan="2" class="method_sign">int length;</td></tr>
<tr><td>[result]</td><td>The length of the token.</td></tr>

<tr><td colspan="2" class="method_sign">int token;</td></tr>
<tr><td>[result]</td><td>The index of the token type. The index is the position of the regex description in the regex list which is passed into the constructor of RegexLexer.</td></tr>

<tr><td colspan="2" class="method_sign">const wchar_t* reading;</td></tr>
<tr><td>[result]</td><td>The pointer positionof the token in the input string.</td></tr>

<tr><td colspan="2" class="method_sign">int lineIndex;</td></tr>
<tr><td>[result]</td><td>The line index of the token.</td></tr>

<tr><td colspan="2" class="method_sign">int lineStart;</td></tr>
<tr><td>[result]</td><td>The position of the token in the current line.</td></tr>

<tr><td colspan="2" class="method_sign">int codeIndex;</td></tr>
<tr><td>[result]</td><td>The identifier of the input string. The identifier is the argument that passed into the RegexLexer::Parse function.</td></tr>
</tbody>
</table>
<h2><a name="RegexTokens">RegexTokens</a></h2>
<p>RegexTokens contains all tokens of a string.</p>
<p>See <a href="#Grammar">Regular expression grammar reference</a> for more information.</p>
<table>
<thead><tr><td colspan="2" style="text-align:center; font-weight:bold">Member functions</td></tr>
<tr><td style="width:100">Parameter</td><td>Description</td></tr></thead>
<tbody>
<tr><td colspan="2" class="method_sign">collections::IEnumerator&lt;RegexToken&gt;* CreateEnumerator()const;</td></tr>
<tr><td>[result]</td><td>Token enumerator.The lexical analyzer begins to parse a token when the Next function of the enumerator is invoked.</td></tr>

<tr><td colspan="2" class="method_sign">void ReadToEnd(collections::List&lt;RegexToken&gt;& tokens, bool(*discard)(int)=0)const;</td></tr>
<tr><td>tokens</td><td>A list to receive all tokens.</td></tr>
<tr><td>discard</td><td>A filter function receiving a token type index(RegexToken::token). Return true if the token should be filtered out, Return false to keep the token.</td></tr>
</tbody>
</table>
<h2><a name="RegexLexer">RegexLexer</a></h2>
<p>Lexical Analyzer</p>
<p>See <a href="#Grammar">Regular expression grammar reference</a> for more information.</p>
<table>
<thead><tr><td colspan="2" style="text-align:center; font-weight:bold">Member functions</td></tr>
<tr><td style="width:100">Parameter</td><td>Description</td></tr></thead>
<tbody>
<tr><td colspan="2" class="method_sign">RegexLexer(const collections::IEnumerable&lt;WString&gt;& tokens);</td></tr>
<tr><td>tokens</td><td>Token type description list. A token type is described by using a regular expression. The token type of a token is stored in RegexToken::token.</td></tr>

<tr><td colspan="2" class="method_sign">RegexTokens Parse(const WString& code, int codeIndex=-1);</td></tr>
<tr><td>code</td><td>String that to be parsed.</td></tr>
<tr><td>codeIndex</td><td>User data for external record.</td></tr>
<tr><td>[result]</td><td>Token list</td></tr>
</tbody>
</table>
<h2><a name="Grammar">Regular expression grammar reference</a></h2>
<p>Regular expression in Vczh Library++ Core Library has the following components:</p>
<li><a href="#Char">Character and Escaping</a></li>
<li><a href="#Loop">Looping</a></li>
<li><a href="#Seq">Sequence and Choice</a></li>
<li><a href="#Capture">Capturing</a></li>
<li><a href="#Back">Back reference</a></li>
<li><a href="#Lookahead">Positive/Negative lookahead assertion</a></li>
<li><a href="#Rename">Sub expression renaming</a></li>
<p>The constructor for Regex has a preferPure parameter. When preferPure is true, Regex will use a much faster algorithm if possible. The following table shows how Regex makes its decision for matching and testing.</p>
<table>
<thead><tr><td>Component</td><td>Fast matching</td><td>Fast testing</td></tr></thead>
<tbody>
<tr><td>character set</td><td>Yes</td><td>Yes</td></tr>
<tr><td>x{a}, x{a,}, x{a,b}, x+, x*, x?</td><td>Yes</td><td>Yes</td></tr>
<tr><td>ab, a|b</td><td>Yes</td><td>Yes</td></tr>
<tr><td>(&lt;name&gt;x), (?x)</td><td>No</td><td>Yes</td></tr>
<tr><td>$, ^</td><td>No</td><td>No</td></tr>
<tr><td>x{a}?, x{a,}?, x{a,b}?, x+?, x*?, x??</td><td>No</td><td>No</td></tr>
<tr><td>(&lt;$name&gt;), (&lt;$name;i&gt;)(&lt;i&gt;)</td><td>No</td><td>No</td></tr>
<tr><td>(=x), (!x)</td><td>No</td><td>No</td></tr>
</tbody>
</table>
<p><b>In the following guide, we will use /x/ for a regular expression and "x" for a text.</b></p>
<h3><a name="Char">Character and Escaping</a></h3>
<p>There are two kinds of character declarations in regular expression: character and character range. Escaping in these two declaractions are different from each other.</p>
<p>We can use /a/ for "a", /b/ for "b" and /\w/ for letters, digits and underscore. But /^/ and /$/ do not mean "^" and "$". The following table shows which characters should be escaped.</p>
<table>
<thead><tr><td>Character</td><td>Meaning</td><td>Escaping 1</td><td>Escaping 2</td><td>Meaning for escaping</td></tr></thead>
<tbody>
<tr><td>.</td><td>&quot;.&quot;</td><td>\.</td><td>/.</td><td>any character</td></tr>
<tr><td>r</td><td>&quot;r&quot;</td><td>\r</td><td>/r&nbsp;</td><td>0x0D(ASCII)</td></tr>
<tr><td>n</td><td>&quot;n&quot;</td><td>\n</td><td>/n</td><td>0x0A(ASCII)</td></tr>
<tr><td>t</td><td>&quot;t&quot;</td><td>\t</td><td>/t</td><td>tab</td></tr>
<tr><td>\</td><td>&nbsp;</td><td>\\</td><td>/\</td><td>&quot;\&quot;</td></tr>
<tr><td>/</td><td>&nbsp;</td><td>\/</td><td>//</td><td>&quot;/&quot;</td></tr>
<tr><td>(</td><td>&nbsp;</td><td>\(</td><td>/(</td><td>&quot;(&quot;</td></tr>
<tr><td>)</td><td>&nbsp;</td><td>\)</td><td>/)</td><td>&quot;)&quot;</td></tr>
<tr><td>+</td><td>&nbsp;</td><td>\+</td><td>/+</td><td>&quot;+&quot;</td></tr>
<tr><td>*</td><td>&nbsp;</td><td>\*</td><td>/*</td><td>&quot;*&quot;</td></tr>
<tr><td>?</td><td>&nbsp;</td><td>\?</td><td>/?</td><td>&quot;?&quot;</td></tr>
<tr><td>{</td><td>&nbsp;</td><td>\{</td><td>/{</td><td>&quot;{&quot;</td></tr>
<tr><td>}</td><td>&nbsp;</td><td>\}</td><td>/}</td><td>&quot;}&quot;</td></tr>
<tr><td>[</td><td>&nbsp;</td><td>\[</td><td>/[</td><td>&quot;[&quot;</td></tr>
<tr><td>]</td><td>&nbsp;</td><td>\]</td><td>/]</td><td>&quot;]&quot;</td></tr>
<tr><td>&lt;</td><td>&nbsp;</td><td>\&lt;</td><td>/&lt;</td><td>&quot;&lt;&quot;</td></tr>
<tr><td>&gt;</td><td>&nbsp;</td><td>\&gt;</td><td>/&gt;</td><td>&quot;&gt;&quot;</td></tr>
<tr><td>^</td><td>Text beginning</td><td>\^</td><td>/^</td><td>&quot;^&quot;</td></tr>
<tr><td>$</td><td>Text endding</td><td>\$</td><td>/$</td><td>&quot;$&quot;</td></tr>
<tr><td>!</td><td>&nbsp;</td><td>\!</td><td>/!</td><td>&quot;!&quot;</td></tr>
<tr><td>=</td><td>&nbsp;</td><td>\=</td><td>/=</td><td>&quot;=&quot;</td></tr>
<tr><td>S</td><td>&quot;S&quot;</td><td>\S</td><td>/S</td><td>not /s</td></tr>
<tr><td>s</td><td>&quot;s&quot;</td><td>\s</td><td>/s</td><td>space, 0x0D, 0x0A, tab</td></tr>
<tr><td>D</td><td>&quot;D&quot;</td><td>\D</td><td>/D</td><td>not /d</td></tr>
<tr><td>d</td><td>&quot;d&quot;</td><td>\d</td><td>/d</td><td>digit</td></tr>
<tr><td>L</td><td>&quot;L&quot;</td><td>\L</td><td>/L</td><td>not /l</td></tr>
<tr><td>l</td><td>&quot;l&quot;</td><td>\l</td><td>/l</td><td>letter, digit</td></tr>
<tr><td>W</td><td>&quot;W&quot;</td><td>\W</td><td>/W</td><td>not /w</td></tr>
<tr><td>w</td><td>&quot;w&quot;</td><td>\w</td><td>/w</td><td>letter, digit, underscore</td></tr>
</tbody>
</table>
<p>We can also use character range for a range or a combination of multiple range of characters. For example, /[a-c]/ means "a", "b" or "c", /[a-zA-Z0-9_]/ means a letter, a digit or a underscore. If you use [^...] instead of [...], it means "any characters that except". For example, /[^a-zA-Z]/ means a character that is not a letter.</p>
<p>Escaping in character range is different from escaping in character. The following table shows which character need to be escaped in a character range.</p>
<table>
<thead><tr><td>Character</td><td>Meaning</td><td>Escaping 1</td><td>Escaping 2</td><td>Meaning for escaping</td></tr></thead>
<tbody>
<tr><td>.</td><td>&quot;.&quot;</td><td>\.</td><td>/.</td><td>any character</td></tr>
<tr><td>r</td><td>&quot;r&quot;</td><td>\r</td><td>/r&nbsp;</td><td>0x0D(ASCII)</td></tr>
<tr><td>n</td><td>&quot;n&quot;</td><td>\n</td><td>/n</td><td>0x0A(ASCII)</td></tr>
<tr><td>t</td><td>&quot;t&quot;</td><td>\t</td><td>/t</td><td>tab</td></tr>
<tr><td>-</td><td>&nbsp;</td><td>\-</td><td>/-</td><td>&quot;-&quot;</td></tr>
<tr><td>\</td><td>&nbsp;</td><td>\\</td><td>/\</td><td>&quot;\&quot;</td></tr>
<tr><td>/</td><td>&nbsp;</td><td>\/</td><td>//</td><td>&quot;/&quot;</td></tr>
<tr><td>[</td><td>&nbsp;</td><td>\[</td><td>/[</td><td>&quot;[&quot;</td></tr>
<tr><td>]</td><td>&nbsp;</td><td>\]</td><td>/]</td><td>&quot;]&quot;</td></tr>
<tr><td>^</td><td>Text beginning</td><td>\^</td><td>/^</td><td>&quot;^&quot;</td></tr>
<tr><td>$</td><td>Text endding</td><td>\$</td><td>/$</td><td>&quot;$&quot;</td></tr>
</tbody>
</table>
<h3><a name="Loop">Looping</a></h3>
<p>Looping means repeating a pattern. There are several kinds of loopings.</p>
<table>
<thead>
<tr><td>Pattern</td><td>Meaning</td></tr>
</thead>
<tbody>
</tbody>
<tr><td>X+</td><td>One or more X</td></tr>
<tr><td>X?</td><td>Zero or one X</td></tr>
<tr><td>X*</td><td>Zero or more X</td></tr>
<tr><td>X{3}</td><td>Three Xs</td></tr>
<tr><td>X{3,}</td><td>Three or more Xs</td></tr>
<tr><td>X{3,5}</td><td>Three, four or five Xs</td></tr>
</table>
<p>If a looping pattern is followed by a "?", then the looping will terminated as early as possible.</p>
<p>For example, /a+/ matches "aaa" in "aaa", but /a+?/ matches "a" in "aaa". But both /a+b/ or /a+?b/ matches "aaab" in "aaab".</p>
<h3><a name="Seq">Sequence and Choice</a></h3>
<p>Sequence of two pattern means that the second pattern should appears right after the first pattern. Sequence is easy to understand. For example, we need to match an e-mail address like "vczh@163.com" or "vczh@hotmail.com", we need:</p>
<p><b>\w+@\w+.\w+</b></p>
<p>This is a sequence that combines /w+/, /@/, /\w+/, /./ and /w+/ together. These five patterns should appear one after another in the defined order. When /w+@w+.w+/ matches "vczh@163.com", it gets:<br />
/w+/ = "vczh"<br />
/@/ = "@"<br />
/w+/ = "163"<br />
/./ = "."<br />
/w+/ = "com"</p>
<p>Choice is something like a "or" operator. /a|b|c/ means [abc], /ab|ac|de/ means "ab", "ac" or "de". The priority of choices is lower than sequences.</p>
<h3><a name="Capture">Capturing</a></h3>
<p>Capturing is a way to get a more particular parsing result. Capturing enables Regex to store many sub strings that matched by the selected sub patterns, and put them under a name-string map.</p>
<p>For example. If you want to get the user name and the server name out of an email address, you can use /(&lt;NAME&gt;\w+)@(&lt;SERVER&gt;\w+.\w+)/. After you get a RegexMatch successfully, you can use match.Groups()[L"NAME"][0] and match.Groups()[L"SERVER"][0] to get the user name and the server name.</p>
<p>You can capture several sub strings in one group.</p>
<p>The first example is /(&lt;GROUP&gt;\d+).(&lt;GROUP&gt;\d+)/. It parses a real number and store two part of digits in the same group called "GROUP".</p>
<p>The second example is ((&lt;GROUP&gt;\d+),)*(&lt;GROUP&gt;\d+). It parses a integer list(separated by a comma, like "12,345,6") and stores all integers of the list in the same group called "GROUP".</p>
<p>There is another kind of captures called "Anonymous Capture". Anonymous captures are captures that with an empty name. You cannot use /(&lt;&gt;expression)/ to get an anonymous capture. The correct syntax is /(?expression)/</p>
<p>So, we have two kinds of captures by now:<br />
Named captures: /(&lt;NAME&gt;expression)/<br />
Anonymous captures: /(?expression)/</p>
<h3><a name="Back">Back reference</a></h3>
<p>Back reference is a way to compare a sub string with a captured sub string. There are three kinds of back references:</p>
<p>(&lt;$NAME&gt;): a sub string that appears in the specified group.</p>
<p>(&lt;$NAME:i&gt;): a sub string that is the i-th string captured by the specified group.</p>
<p>(&lt;$i&gt;): a sub string that is the i-th string captured by the anonymous group.</p>
<p>Here i should be a non-negative integer. The index starts from 0. If the index is out of range, the comparison fails.</p>
<p>Here is an example to tell if a string is a repeat of another string, like "aaaaa", "abCabCabC" or "1,1,1,1,1,": /<font color="red">(?\.+?)(&lt;$0&gt;)+</font>/.<br />
Here /(?\.+?)/ tries to store one character first, and see if the remain of the string is a repeat of the first anonymous captured string. If fails, it clears the anonymous captures and tries by two characters, three characters and more until a shortest solution is found or no solution is found.</p>
<h3><a name="Lookahead">Positive/Negative lookahead assertion</a></h3>
<p>Positive/nagetive lookahead assertion is a tool for finding a pattern whose context satisfies a condition. For example:<br />
We need to find a "windows" that followed by "2000", we use /windows(=2000)/<br />
We need to find a "windows" that not followed by "2000", we use /windows(!2000)/</p>
<p>The assertion part of a lookahead assertion can be any regular expression. For example:<br />
We need to find a "email" that followed by some symbols and an email address, we use /email(=\W*\w+@\w+.\w+)/<br />
</p>
<h3><a name="Rename">Sub expression renaming</a></h3>
<p>Sub expression renaming is a way to make your regular expression becomes shorter by renaming some patterns that appears everywhere and use a name instead of the pattern itself to form a larger pattern.</p>
<p>For example, if we need to match an IPv4 address we need to tell if a text is a integer between 0 and 255.<br />
1, we get /25[0-5]|2[0-3]\d|1\d\d|[1-9]\d|\d/, and named it "BYTE".<br />
2, we get /((&lt;&BYTE&gt;).){3}(&lt;&BYTE&gt;)/ for an IPv4 address.<br />
3, put the renaming and the pattern in a single regular expression and get: /(&lt;#BYTE&gt;<font color="red">25[0-5]|2[0-3]\d|1\d\d|[1-9]\d|\d</font>)<font color="blue">((&lt;&BYTE&gt;).){3}(&lt;&BYTE&gt;)</font>/.</p>
<p>We don't need to use /(<font color="red">25[0-5]|2[0-3]\d|1\d\d|[1-9]\d|\d</font>.){3}<font color="red">25[0-5]|2[0-3]\d|1\d\d|[1-9]\d|\d</font>/ in this case.</p>
<p>A regular expression can have many renamings, but a sub expression reference to a renamed expression should appear after the renaming declaration</p>
<p>For example, /(&lt;#A&gt;a)(&lt;#As&gt;(&lt;&amp;A&gt;)+)(&lt;&amp;As&gt;).(&lt;&amp;As&gt;)/ (which means /a+.a+/) is correct,<br />
but /(&lt;#As&gt;(&lt;&amp;A&gt;)+)(&lt;#A&gt;a)(&lt;&amp;As&gt;).(&lt;&amp;As&gt;)/ does not.</p>
<p>Sub expression renaming is not part of a pattern, it is only a syntax suger. Every renamings should appears before the final pattern.</p>
</body>
</html>